CONJUGATE GRADIENT WITH SUBSPACE OPTIMIZATION 
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Abstract. In this paper we present a variant of the conjugate gradient (CG) algorithm in which 
we invoke a subspace minimization subproblem on each iteration. We call this algorithm CGSO 
for "conjugate gradient with subspace optimization". It is related to earlier work by Ncmirovsky 
and Yudin. We apply the algorithm to solve unconstrained strictly convex problems. As with other 
^SJ , CG algorithms, the update step on each iteration is a linear combination of the last gradient and 

last update. Unlike some other conjugate gradient methods, our algorithm attains a theoretical 

■ complexity bound of 0{\/L/l log(l/£)), where the ratio L/l characterizes the strong convexity of the 

■ objective function. In practice, CGSO competes with other CG-type algorithms by incorporating 



some second order information in each iteration 
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1. Introduction. The method of conjugate gradients (CG) was introduced by 
Hestenes and Stiefel in 1952 [5] to minimize quadratic objective functions of the form 
/(x) = x*Ax/2 — b*x (or equivalently, to solve Ax. ^ b), where A is a symmetric pos- 
(~| . itive definite matrix. We refer to this algorithm as "hncar conjugate gradient." Later, 

the algorithm was generalized to "nonlinear conjugate gradient" which addresses un- 
constrained minimization of arbitrary differcntiablc objective functions, by Fletcher 
and Reeves |3] and Polak and Ribiere [T3], and others. Nonlinear CG algorithms all 
have the property that when applied to a quadratic objective function and coupled 
with an exact line search, they reduce to linear CG. None of these algorithms, how- 
ever, has a known complexity bound when applied to nonquadratic functions. Indeed, 
Nemirovsky and Yudin [5] argue that their worst-case complexity for strictly convex 
. functions is quite poor. 

Nemirovsky and Yudin, on the other hand, found a different generalization of 
. CG. Actually, their algorithm, which we refer to as NYCG, is not a generaliza- 

■ tion of the CG algorithm itself but rather of its derivation from some equations 

■ and inequalities that underlie it. Their algorithm achieves a worst-case complex- 
ity bound of 0{\n{l/e)yj L/l). Here e is the desired relative accuracy, that is, e = 
{fiX^) — /(x*))/(/(x'') - /(x*)), where x° is the starting point, x* is the optimizer, 
and x" is the final iterate, and L and I characterize the convexity of /. In particular, 
we assume that for all x, y lying in the level set of x*^. 
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||V/(x)-V/(y)||<L||x-y||, (1.1) 
/(y) - /(x) > (V/(x),y - x) + ^||y - xf . (1.2) 

For example, in the case of a convex quadratic function /(x) = x*v4x/2 — b*x, L/l is 
the condition number of A. From inequality (jl.ip . it follows that 

/(y) - /(x) < (V/(x), y - x) + ^||y - xf , (1.3) 

which will be useful in our analysis. 

One drawback of the NYCG algorithm is that in the case of quadratic objective 
functions, it does not reduce to linear CG (and in fact, is much slower in practice). A 
second drawback is that it requires prior knowledge of the ratio L/l. This is because 
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the algorithm needs to be restarted every L/l) iterations in order to achieve the 
above-mentioned complexity bound. 

A further disadvantage of NYCG is that it has a relatively expensive computation 
on every iteration, namely, one must solve a two-dimensional convex optimization 
problem using the ellipsoid method. This drawback was later remedied by a different 
algorithm due to Nesterov [10]. Nesterov's algorithm generates two sequences of 
iterates; the first sequence of iterates is generated such that a sufficient reduction 
in objective function is achieved, and the new iterate in the second sequence is an 
affinc combination of the last two iterates of the first sequence. Nesterov's algorithm, 
however, still has the two disadvantages mentioned in the previous paragraph. 

The purpose of this paper is to develop an algorithm that we call conjugate 
gradient with subspace optimization (CGSO) that mainly avoids the disadvantages 
mentioned above. CGSO is mostly closely related to NYCG. In particular, we seek a 
conjugate-gradient-like algorithm with the following properties. 

1. The algorithm should reduce to linear CG when the objective function is 
quadratic. 

2. The algorithm should have the complexity bound of 0{\a{l/ e)^J L/l) itera- 
tions for convex functions satisfying (jl.ip and (|1.2p . 

3. The algorithm should not require prior knowledge of any parameters describ- 
ing /. 

4. The cost per iteration should not be excessive. 

The algorithm we propose achieves goals 1-3, and mostly achieves goal 4. Like NYCG, 
our algorithm must solve a convex optimization subproblem on every iteration. The 
dimension of the subproblem is always at least 2 and is determined adaptively by the 
algorithm. The worst-case upper bound we are able to prove for the dimension of this 
subproblem is O(logj), where j is the number of iterations so far. However, in our 
testing, the dimension of the subproblem was 2 in almost every case; in one test case 
it reached the value 3 for a few iterations, but it never exceeded 3. Furthermore, we 
find that solving the subproblem is usually fairly efficient because we usually apply 
Newton's method, using automatic differentiation to obtain the necessary first and 
second derivatives. 

Goal 3, the condition of no prior knowledge of parameters, has the obvious benefit 
of making the algorithm easier to apply in practice. It also has a second subtler 
benefit. For some convex problems, e.g., minimization of a log-barrier function, there 
is no prior bound on L/l over the domain of the function since the derivatives tend to 
infinity at the boundaries. However, as the minimizer is approached, the bad behavior 
at the boundaries becomes irrelevant, and there is a new smaller ratio L/l relevant 
for level sets in the neighborhood of the minimizer. In this case, CGSO automatically 
adapts to the improved value of L/l. Such adaptation is possible also with NYCG, 
and Nesterov's algorithm too, but, as far as we know, the adaptation must be done 
by the user and cannot be easily automated. 

Methods reviewed in this paper are among techniques that are generally referred 
to as "first-order algorithms" , because they use only the first derivative information 
of the function in each iteration. Due to the successful theory of NYCG and Nesterov 
techniques, first-order algorithms have attracted many researchers during the last 
decade and have been extended to solving different classes of problems. Nesterov in 
[llj propsed a variation of his earlier algorithms for minimizing a nonsmooth function. 
In addition to nonsmooth optimization, Nesterov algorithm has been adapted for 
constrained problems with simple enough feasible regions so that a projection on these 
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sets can be easily computed. One may refer to }15| and references therein for a more 
elaborated discussion on different adaptations of Nesterov's algorithm. The focus of 
this paper, however, is more on CG algorithm and not on first-order techniques in 
general. 

We now provide further background on conjugate gradient. The original linear 
CG has the following form: 

^j+i - _ _L^__^ (1.4a) 



(dJ)MdJ' 



d.+i = _r^-+i+(£z±l)!^d,. (1.4b) 

In the above equations r-* is Vf{x) ~ Axi' — b and — — r*'. It is possible to 
show that the number of iterations in linear CG is bounded by the dimension of the 
problem, n. For more details on linear CG, one may refer to [S] or |12| . 

Nonlinear CG was proposed by Fletcher and Reeves [3] as an adaptation of the 
above algorithm for minimizing a general nonlinear function. The general form of this 
algorithm is as follows: 

xi+i=x^"+aM^ (1.5a) 



d^'+i = -g^'+i (1.5b) 

Here, d-' is the search direction at each iteration, g-''''^ is the gradient of the 
function at (j + l)th iterate, i.e., V/(x-'"''^); and is the step size, usually determined 
by a line search. Different updating rules for /J-' give us different variants of nonlinear 
CG. The most common formulas for computing are: 



Fletcher-Reeves (1964): l3FR = ^^j-f/, 
Polak-Ribiere (1969): Ppb. = ^ — 

Hager and Zhang |6| present a complete list of all updating rules in their survey on 
nonlinear CG. The convergence of nonlinear CG is highly dependent on the line search; 
for some the exact line search is crucial. There are numerous papers devoted to the 
study of global convergence of nonlinear CG algorithms, most of which discuss variants 
of nonlinear CG that do not rely on exact line search to be globally convergent. Al- 
Baali [I] discusses the convergence of Fletcher- Reeves algorithm with exact line search. 
Gilbert and Nocedal [4] study the convergence of nonlinear CG algorithms with no 
restart and no exact line search. Dai and Yuan [2] present a nonlinear CG for which 
the standard Wolfe condition suffices. A recent variant of CG has been proposed 
by Hager and Zhang [7] that relies on a line search satisfying the Wolfe Conditions. 
Furthermore this algorithm has the advantage that every search direction is a descent 
direction, which is not necessarily the case in nonlinear CG. 

From Yuan and Stoer's perspective |16j . CG is a technique in which the search 
direction d-'^^ lies in the subspaee spanned by Sp{g^ ,d^}. In the algorithm they 
propose they compute the new search direction by minimizing a quadratic approxi- 
mation of the objective function over the mentioned subspaee. A more generalized 
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form of CG called Heavy Ball Method, was introduced by Polyak [Tl], in which x^+^ is 
+a{—g^) + f3{x.^ — x^^^). He proved a geometric progression rate for this algorithm 
when a and /S belong to a specific range. In the same work, Polyak reviews the CG 
method; our simplest form of CGSO, which is given in section [2TT1 coincides with his 
presentation of CG. 

The remainder of this paper is devoted to CGSO and its properties. In section 
[2] we describe the algorithm. Our main result on the convergence of the algorithm is 
presented in section [^751 The implementation of the algorithm is discussed in section 
[21 furthermore the results and comparison of CGSO with CG are presented in this 
section. 

2. CGSO. In this section we present CGSO for solving the problem 

mill / (x) , (2.1) 

where /(x) is a strictly convex function characterized by parameters L and /. We 
follow the standard notation throughout this paper. (., .) represents the inner product 
of two vectors in proper dimension, and ||.|| stands for the 2- norm of a vector unless 
otherwise is stated. Bold lower case characters and upper case characters are used for 
vectors and matrices respectively; and their superscript states the iteration count. 

2.1. The Algorithm. Let g(x) denote the derivative of /(x); CGSO, in its 
preliminary form, is as follows: 

• x*^ = arbitrary; 

• for j = 1,2,... 

- x^+i = x^' + a^g(xJ') + /3^d^ where 
d-' = x^ — x-'^^ and 

a\l3^ = argmin^^^ /(x-' + ag{-x.^) + /3d^) 

As we shall see, some modifications are necessary to achieve the desired com- 
plexity bound. The above algorithm is certainly a generalization of nonlinear CG, 
since each search direction is a combination of previous gradients. Notice that the 
above algorithm is a descent method (i.e. f{x!>'^^) < f{x!>)). Furthermore, the above 
algorithm performs at least as well as steepest descent in each iteration; hence it con- 
verges to a stationary point, which, for the class of convex problems, coincides with 
the minimizer of the function. 

For convenience, let us represent (/(x^) by g^. and let w/(x) denote the residual 
of the function, i.e. /(x) — /(x*). By strong convexity of the function, we get the 
following properties for the above algorithm: 

(a) /(x^+i)</(x^)-^||g^f 

(b) (g^x*-x^) </(x*)-/(x^) 

(c) «/(xO) = /(xO)-/* >i(x*-x0)2 

Property (jsj) follows from (|1.3p and the fact that /(x^+^) < /(x-' — -^g-'). Property 
(|b|) is true by convexity of the function, and property (jcj) is a direct derivation from 
inequality (|1.2p . We are now ready to present the main lemma on the complexity 
bound of the algorithm. 

Lemma 2.1. Suppose m > 8p^J^ and 

^^^"""'i"^^""^^ +EA^(g^-^--")<0, (2.2) 
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and 



(2.3) 



are satisfied, where p is a constant > 1, and 

= 



/(x^)-/(x^-+i) 

J||2 



V lie- 

. Then the residual of the function is divided in half after m iterations; i.e. Vf{x"^) < 
iz;,(xO). 

Proof. Our proof is an extension of the proof in section 7.3 in [9]. Suppose by 
contradiction that m > 8p^J^ , (|2.2p and ()2.3p are satisfied; but w/(x'") > . 
By definition of A-' , 

/(x^-+i)=/(x^)-(A^y iig^f, 

hence 

Summing these inequalities over j = 0, • • ■ , m — 1, we get: 

rn— 1 

0<z;/(x™)=z;/(x«)-^(AO'||g^f, 

j=o 

or equivalently, 

m — 1 

E (A^Tllg^f <t'/(x°). (2.4) 

By convexity of the function we have, 

(g^x* - x^) < /(x*) + /(x-') = -«/(x-'"), 

and so 

(g^x* - x°) - (g^x^' - x") < -Vf{x^) < -v/(x'") < 

Let's consider the weighted sum of aU the above inequahties for j = 0, . . . m — 1 
with weights A^ 's to get: 



-«/(x°) 



/ ^ A^g^ X* - xO \ - ^ A^" (g^ x^" - x") < 

\ 3=0 I J=0 

which can be rearranged to the foUowing form, 



-«/(x°) 



/m-1 

E ^'g': 

\3=0 



x-xo <-!^ 



E-^' I + E^'(g''^'-^o)• 
, J=0 / j=0 
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Equivalently we can rewrite the above inequality as: 

«/(x°) 



/ m-l 

,i=0 



/(x-)-/(x», (^^,,^^,^^_^_^^^ 



Using inequality (|2.2p along with the facts that /(x*) < /(x-') and A-* > for all 
j, we get: 



/ m — 1 



^A^g^x*-x") <- 



«/(x") 



By the Cauchy-Schwarz inequality we have 



E ^'g' 



l|x* -x°|| ^ (^E A^g^x* -x"y < 



hence 



E ^'g' 



i=0 

By property (jcj) we have 



II * oil «/(x°) 
X* - x" > — — - 



.0\ ra-l 

E^^ 

J=0 



x* -x'^ < 



2«/(x") 



Furthermore, by inequalities (|2.3p and (|2.4p we get: 



m— 1 



E ^'g' 



Replacing inequalities (|2.7I) and (|2.8p in inequality (j2.6p . we get 



m — 1 



Notice that by definition of A and property > J for ah j, so 



m — 1 



J=0 



Using this fact in inequality (|2.9p . we get 



r— /2«/(x") t;/(xO) / [T 



(2.5) 



(2.6) 



(2.7) 



(2.8) 



(2.9) 



Conjugate Gradient with Subspaee Optimization 



7 



therefore 



m < 



(2.10) 



which contradicts our assumption on the value of m. □ 

Lemma 12.11 shows that under conditions (12.21) and (12.31) , the residual of the func- 



tion is divided in half every m = ^iy j) iterations. For the next sequence of m 

iterations, a further reduction of ^ is achieved provided (|2.2p and (|2.3p hold, with 
substituted in place of x°. Hence by letting x™ be the new x" and repeating the 

same algorithm, we can find the e-optimal solution in [log2 j~\ ^p^J~^ iterations. 
Of course, m is not known to our algorithm, a point to which we return in the next 
section. 

2.2. Restarting. As mentioned before, the algorithm should use first 0, then 
TO, then 2m and so on when checking (|2.2p and (|2.3p in order to achieve the desired 
complexity bound. We use the term "restarting" to refer to the process of replacing 
x° in inequalities (|2.2p and ()2.3p with x™ for some to > 0. Notice that this process 
does not change any of the iterates, and it only changes the interval of indices in 
which we check inequalities (j2.2[) and (j2.3|) . 

If we know parameters of the function, i.e. L and /, we can compute to directly; 
hence we would be able to determine exactly when we need to restart the algorithm. 
In many cases, however, finding the parameters of the function is a nontrivial problem 
itself. In order to have an algorithm which does not rely on any prior knowledge about 
the function, we propose the following technique. 

Since 2^ < m < 2^"+^ for some p, by restarting the algorithm every 2^+^ iterations, 
we are guaranteed that the residual of the function is divided in half between every two 

consecutive restarts which contains at most 2to = [IGpy^] iterations, where the last 
statement follows from the fact that 2^'+^ < 2to. Since the exact value of p is usually 
unknown, we repeat the above procedure for every value of p € {Pi, . . . , Pu}- In the 
section on implementation of the algorithm we will discuss the range of p we used, 
however note that P^ does not need to exceed [logj j~\ , where j is the current iteration 
count. This is because inequalities p.2p and p.3p are the same for all p > [log2 j]. 
We can now state the CGSO algorithm in a more complete form: 



x" = arbitrary; p : given; 



For j = 1,2,, 



argmin/(a; 
Let = 



where = x^ + Sp {g-' , d^'} 



— 2 



IIS- 

For p^ Pi,..., riogjj] 

If j + 1 = kp2P for some integer kp 
Let Tp = {kp — 1) 2P; check 

(/(x^ + ^)-/(x'-p)) f J 



and 



(el., a*) + ELr, A» (g^ X' - x--^) < 



If any of the above inequalities fails, take the "correction step" . 
(which will be defined in the subsequent section) 
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2.3. Correction Step. Let us refer to the set of iterates between two consec- 
utive restart as a "block" of iterates; in other words for any p, the set of iterates 
xO,x\. 



^.2(2'') 1 ^Yie second 



jX^" ^ is the first block of size 2^, x^" , x^" , 
block of size 2^, and so on. At the end of each block we check inequalities (|2.2p and 



2.3p . If they are satisfied and 2^ > 



8p 



, then by Lemma 12.11 we know that the 



residual of the function is divided in half; however if any of these inequalities fails, 
then as mentioned in the previous section we need to take "correction step" for the 
next block of iterates. The correction step is basically computing the next block of 
iterates in a way that satisfaction of inequalities ()2.2|) and (|2.3|) is guaranteed at the 
end of this block. Then the correction step is omitted in the subsequent blocks until 
the inequalities are violated again. 

Recall that in an ordinary step, the new iterate, x^+^ is calculated by a 2- 
dimensional search on the plane passing through x^ , and parallel to the 2-dimensional 
subspace spanned by g-' , and . Suppose at least one of the inequalities (|2.2p and 
()2.3p is violated for fcth block of p; i.e. for the block of iterates x''p, . . . ,x''p+^''^^ 
where rp = {k — 1)2^. Then for the next block we search for the new iterate x-'^^ on 



the space of x-^ + 5p { g-' , d-' , , x-' - si^p } where = J2i=rp • 

Finding the new iterate x^+^ through a search on the space that in addition to g^ 
and d^ includes q^ and x-' —x^p is what we referred to as "correction step" . Notice that 
for each p with the violated constraints we increase the dimension of the search space 
by 2. However, the dimension of the search space never exceeds 0{Pu) = 2 + 2[log2.?], 
which happens to be the case when the inequalities are violated for all possible values 
of p. 

It is quite easy to see that inequalities (|2.2|) and p.3|) are satisfied for the (fc + l)th 
block of p when we take the correction step throughout it. By KKT condition, we 
have ^g^ , x^ — x''p) = for all j in this block. Using this, along with the fact that 
/(x^) < /(x''p) and non- negativity of A-' for all j, we derive (|2.2p . Similarly one can 
argue that by KKT (g^q^^^) = for aU j, hence 



P+2P- 

E 



yg' 



rp+2P-l 



in 



which means inequality (|2.3p is satisfied. After finding the iterates of one block 
through a correction step, the algorithm switches back to taking a regular step until 
the next failure of the inequalities. We can now present the algorithm in its entirety. 
Algorithm 1. 



x'' = arbitrary] p : given; S ~ 
for j = 1,2,... 

x-'^^ = argmin/(x) where = x> + Sp {g-' , d^ , Upgsq^, Upgsx^ - x*"" } 

for p = P(,. . . , \\og2j^ 

it j + 1 = kp2P for some integer hp 
Let rp = {kp - 1) 2^ 
if pe S* 

S = S\{p} 
else (i.e. ifp^S), check 
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(/(x^ + ^)-/(x'-p)) 



(el., a*) + EL., A' (g% X* - x'>) < 



j/ any of the above inequalities fails, S = S \J {p} 



end 



end 



end 

end 

Theorem 2.2. Suppose m > 



, and xP is a sequence generated by algo- 

rithm[J\for solving problem (|2.ip .- then -(;/(x'^"+'*)™) < it;/(x"™). 

Proof. Let p < P he the integer for which 2^~^ < m < 2^; and let Sp stand for 
2^. Using algorithm [U we are guaranteed that for at least one of any two consecutive 
blocks of size Sp inequalities (|2.2p and (|2.3p are satisfied. The size of this block is 

Sp > m> 8pWj and hence by lemma [01 we have 



Since 2sp < 4m, so /(x 



n??i+4m \ 



</(x 



.nm+2sf 



); hence 



nm+4m 



) < w/(x"™+""*). 



(2.11) 



(2.12) 



p. lip and p.l2p gives us the result we wanted to show. □ 
In our experiment with the algorithm, however, we only include the x-' — x''p term 
in the correction step. We will discuss this and few other remarks on the implemen- 
tation of the algorithm in the following section. 

3. Implementation. 

3.1. Solving Each Iteration. Obviously the most important part in CGSO is 
solving the subspaee optimization problem in each iteration. In our implementation 
of the algorithm, we used Newton's method for this task unless it fails to rapidly 
converge to the optimum. In the case of failure of Newton's algorithm, the ellipsoid 
method carries out the task of solving the optimization problem. In other words we 
assign an upper bound to the number of iterations that Newton's method may take, 
and if it fails to converge within the given number of iterations the algorithm switches 
to the ellipsoid method for finding the next iterate. 

Recall that at (j + l)th iterate, we search for x^+-'^ in the space of vectors x = 
x^ + ag^ + pd^ + aQ + hi?, where Q G M"^!'^! is the matrix formed by columns 
for all p G S; R is the matrix of the same dimension with columns x-' — x''^ for all 
p e 5; G M, and a, b G M''^' are coefficients that we want to find. Let y denote 
the variable of the subspaee optimization problem, i.e., y = [a, /?, a*, b*i 



in addition 



let B = [g^, d-', Q, i?] and A' = 2 + 2 |S'|. We can now state the formal presentation 
of the subspaee optimization problem. 



mm/(x^+By) (3.1) 

As mentioned above, we solve problem p.ip with Newton's method. Letting 
/(y) = /(x-' + By) and using chain rule we get the following formulas for the gradient 
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and Hessian of each Newton's iteration, 

V/(y) = i?*V/(x) (3.2) 

VV» = B*V2/(x)S (3.3) 

Notice that some second order information of the function comes into play in 
equation p.3p : and we beheve that this is one of the reasons that CGSO stands out 
in practice. We compute V/(x) and V^/(x) directly when /(x) is simple enough. 
For more complicated functions we use automatic differentiation (AD) in backward 
mode to compute V/(x) and V^/(x)i3. Let B^'^^ denote kth column of matrix B. 
Backward AD enables us to keep the computational cost of V/(x) within a constant 
factor of the objective function evaluation cost; and the cost of computing V^/(x)i?('^^ 
within a constant factor of the computational cost of gradient evaluation. The storage 
space required in backward AD, however, is more than the required storage in forward 
AD; and in worst case it can be proportional to the number of operations required 
for computing /(x). In spite of that, we are able to keep the storage required by 
backward AD for the class of problems we studied in 0(s), where s is the space 
required to compute /(x). Details on our test problems are presented in section [321 
For more information on AD, one may refer to |12j . 

In addition to the storage required by AD, we need to store x-', and matrix B; 
we also need to update and store k^p , X]i=rp -^'i J2i=rp (sN^* ~ x'"p), X]i=rp ^^S^^ 

J2i=r (•^*)^ for p e {Pi, . . . ,log2 j}. In our experiment with CGSO we take 

Pi to be 4. We find that K is equal to 2 in every case except one, in which it 
reaches the value of 3. Hence the required storage space for the above elements is in 
0{n max{A", log2 j}). 

A difficulty we need to overcome in solving a problem with CGSO, especially when 
we are trying to achieve high accuracy, is round-off error. In particular /(x^)— /(x^+^), 
which is required for computing , gets more and more inaccurate as the iterates 
approach the optimum. To overcome this, we took advantage of the second order 
Taylor series expansion. We have a subroutine that analyze the absolute error of 
/(x-^) — /(x-'+^) computed directly and through Taylor series, i.e. /(x-') — /(x-'+^) w 
-(V/(xJ)*(xJ+i - x^) + i(xJ+i - xJ)* V2/(x^')(xJ+i - x^); the one with smaller error 
is accepted. In some cases the error analysis is not easy, hence a heuristic is used 
to choose the preferred formula. Computing the difference of two objective values is 
appeared in inequality p.2p as well, in /(x™~^) — / (x°); the same subroutine is used 
to compute this term. 

3.2. Numerical Results. We have tested our algorithm on the following classes 
of problem: 

• /i(x) = - E" 1 log (a*x - b,) 

• /2(x) = -c*x - log (dot {C - Diag(x))) 

• /3(x) ~ Ei=i (^iX — hi) where d is a given even integer. 

Functions /i and /2 are log-barrier functions, and /a is an approximation to the 
infinity norm of the vector Ax — b. Note that we need to restrict the degree, d, to 
even numbers to save the convexity of /a. 

All the codes for CGSO, Ellipsoid method, and automatic differentiation are 
written in MATLAB. The Hessian of /i and /a can be derived easily in closed form. 
The Hessian of /2 was obtained by applying backward mode AD by hand to a program 
that computes /2. 
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The stopping criteria we used for both CGSO and its subproblems are ||V/(x)|| < 



and 



V/(x) 



< e^r, respectively. In our implementation, ejv at iteration j is 



Too^ ' however, if the subproblem is solved by Newton's method, the obtained 
solution is usually more accurate due to the fast local convergence of the New- 
ton's method. Notice that for /i and /2 we have some hidden constraints, namely 
a*x — bj > in /i and C — Diag(x) ;^ in /2. Therefore CGSO switches to ellipsoid 
method if Newton's method does not converge in 15 iterations or if the iterates get 
infeasible. 

In [7], Hager and Zhang compared their variant of CG with L-BFGS and PRP+, 
and established the superior performance of their variant of CG. Hence, we compare 
our algorithm with their variant of CG, represented by CGhz in Table 13.11 that 
summarizes this comparison. The top portion of the table explains the instances that 
we generated randomly. Parameters m, n, d, and e are defined as above; mj for /2 
represents c = m^e where e is the all ones vector. The point of this parameter is that 
larger values of rrif force C — Diag(x*) for the optimizer x* closer to the boundary of 
the feasible region, hence the problem become more ill-conditioned. 

Let A be the to by n matrix formed by a* as its rows; "ds." indicates the density, 
the percentage of nonzero entries of A for /i and /s, and density of C for /2. "It. count" 
for both CGhz and CGSO indicates the total number of major iteration the algorithm 
takes before convergence. "LS. count" is the total iteration performed by the line 
search routine in CGhz', for each iterate of LS we need one gradient evaluation. In 
CGSO, one gradient and one Hessian evaluation is required for each Newton's iterate; 
and one gradient is calculated in each iteration of the ellipsoid algorithm. As table 13.11 
shows, the number of iterations CGSO takes is less than CGhz in all instances, and 
the number of iterations for solving subproblems is significantly less than the number 
of line search iterations; especially for instances 3 and 4, this difference is considerable. 
The last row of the table implies that no correction step was taken in any of these 
instances, except one. Actually the purpose of "Ins. 7" is to show that correction 
step might be required for some problems. In this instance A is a square matrix of 
size 50 X 50 with 0(10'^) condition number. The correction step is taken throughout 
7 blocks corresponding to Pi, for which the size of the subproblems reaches 3. This, 
however, is consistent with the theory because the larger the condition number is, the 

larger j is; therefore 2^' may not be a good approximation of 

3.3. Parameter p. Recall that p is a parameter of CGSO required for checking 
inequality (|2.3|) . As mentioned before, we do not check inequality (|2.3p in our imple- 
mentation of the algorithm; instead we gather the values of p and study their trend. 
Table 13.21 summarizes the maximum value p reached in each instance over all values 
of p. Figure 13.11 and 13.21 depict p for Ins. 1 and Ins. 3, respectively. We chose these 
instances because they have higher pmax and iteration count in each category. Our 
observation suggests that a reasonably small bound for p should suffice. Notice that 
the plotted values imply that p reaches its maximum for some p and then it decreases 
for higher values of p, so it does not grow with p. Furthermore, as we get closer to 
the optimum parameter p gets closer to 1. This is a common pattern for all test 
problems we had. Instead of fixing p one may adaptively update this parameter. In 
other words, we can assign a value to p and if inequality p.3p fails for more than a 
certain number of blocks, then we increase p, and as the algorithm progresses towards 
the optimum we can decrease p. 
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Table 3.1 
Comparison of CGSO and CGuz 





/i 


/2 


h 


Ins. 1 


Ins. 2 


Ins. 3 


Ins. 4 


Ins. 5 


Ins. 6 


Ins. 7 


m 


6000 


6000 






1500 


2500 


50 


n 


onnn 
ZUUU 


onnn 
zUUU 


OUU 


1 nnn 
iUUU 


oUUU 


OUUU 


OU 


ruf 






100 


10 








d 










6 


4 


4 


ds. 


1 


0.5 


0.012 


0.006 


0.5 


0.5 


1 


e 


lc-8 


le-8 


le-3 


le-8 


le-18 


le-18 


lc-8 


CGhz 


It. count 


404 


357 


3059 


928 


225 


203 


3442 


LS. count 


2999 


2535 


92682 


14596 


6220 


7333 


53474 


CGSO 


It. count 


261 


213 


1080 


419 


148 


136 


658 


Newton 


402 


323 


1884 


600 


545 





1055 


ellipsoid 


394 


422 


6195 


1641 


410 








Corr. 
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Table 3.2 
Parameter p 





Ins. 1 


Ins. 2 


Ins. 3 


Ins. 4 


Ins. 5 


Ins. 6 


Ins. 7 


Pmax 


1.9848 


1.9803 


3.2511 


2.144 


1.016 


1.0281 


1.9430 



4. Conclusion. Wc presented CGSO for solving unconstrained strongly convex 
functions. CGSO is a variant of CG, since the update step in each iteration is a combi- 
nation of previous gradients, and it reduces to linear CG when applied to a quadratic 
function. The coefficients of previous gradients, however, do not follow an updating 
rule and are found by a subspace optimization subproblem. We have discussed that 
these subproblems are mostly two dimensional; in some cases the dimension of the 
subproblems may reach slightly higher values, but it certainly never exceeds O (log j), 
where j is the iteration count. We have also shown that CGSO benefits from the 

optimal complexity bound of O ^y^log (7)^ in theory. CGSO does not depend on 
any prior information about the function and can easily be implemented. In practice, 
it outperforms the variant of CG proposed by Hager and Zhang [7] which is known to 
be stronger than other techniques such as L-BFGS or PRP_|-. Practical efficiency of 
CGSO can be improved by incorporating some second order information of the func- 
tion in solving the subproblems with Newton's method as discussed in the previous 
section. 
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-^^p=4 
- e - p-5 
-□- p-6 

+ p=7 



Fig. 3.1. Parameter p for Ins. 1 




Fig. 3.2. Parameter p for Ins. 3 
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